Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Enable gcs scheduler 4/n] Fix PG scheduling by gcs scheduler #27084

Merged
merged 10 commits into from
Aug 29, 2022

Conversation

Chong-Li
Copy link
Contributor

@Chong-Li Chong-Li commented Jul 27, 2022

Why are these changes needed?

This the first split PR of #25075, which tried to enable gcs scheduler by default.

This split PR mainly includes:

  1. In GcsPlacementGroupScheduler::CommitBundleResources() and ReturnBundleResources(), we have to trigger pending actors (in gcs) because resources have been updated.

  2. Still in the above two functions, we have to update PG wildcard resources in a special way. A PG's wildcard resources (in a certain node) has to be the sum of all related bundle resources. Even though CommitBundleResources() uses ToNodeBundleResourcesMap() to sum up bundle resources, it does not handle the scenario that a single (or subset) bundle is rescheduled, in which this single bundle's wildcard resources would wrongly override the existing one. (see test_placement_group_reschedule_when_node_dead for such a scenario).

  3. Fix the remaining issues from ([Core][Enable gcs scheduler 3/n] integrate placement group with gcs scheduler #24842 (comment)).

Related issue number

#25075, #24842

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Chong-Li Chong-Li added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2022
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2022
src/ray/gcs/gcs_server/gcs_placement_group_scheduler.cc Outdated Show resolved Hide resolved
src/ray/raylet/scheduling/scheduling_ids.h Outdated Show resolved Hide resolved
src/ray/raylet/scheduling/scheduling_ids.h Outdated Show resolved Hide resolved
src/ray/raylet/scheduling/scheduling_ids.h Outdated Show resolved Hide resolved
src/ray/raylet/scheduling/scheduling_ids.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_placement_group_scheduler.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_placement_group_scheduler.cc Outdated Show resolved Hide resolved
Chong-Li added 3 commits August 16, 2022 15:55
Signed-off-by: Chong-Li <[email protected]>
@Chong-Li
Copy link
Contributor Author

Do you have any more suggestions to this PR? @rkooo567 @iycheng

@scv119 scv119 merged commit 88a4114 into ray-project:master Aug 29, 2022
@jovany-wang jovany-wang deleted the refactor_gcs_sched_4/n branch August 29, 2022 16:40
XiaodongLv pushed a commit to XiaodongLv/ray that referenced this pull request Sep 2, 2022
…ay-project#27084)

This the first split PR of ray-project#25075, which tried to enable gcs scheduler by default.

This split PR mainly includes:

In GcsPlacementGroupScheduler::CommitBundleResources() and ReturnBundleResources(), we have to trigger pending actors (in gcs) because resources have been updated.

Still in the above two functions, we have to update PG wildcard resources in a special way. A PG's wildcard resources (in a certain node) has to be the sum of all related bundle resources. Even though CommitBundleResources() uses ToNodeBundleResourcesMap() to sum up bundle resources, it does not handle the scenario that a single (or subset) bundle is rescheduled, in which this single bundle's wildcard resources would wrongly override the existing one. (see test_placement_group_reschedule_when_node_dead for such a scenario).

Fix the remaining issues from ([Core][Enable gcs scheduler 3/n] integrate placement group with gcs scheduler ray-project#24842 (comment)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants